In this session we explore some of the tools offered by the igraph package in R for network analysis. These treat networks from a network science perspective, rather than from a more transportation-oriented perspective.
As usual, we need some libraries
library(tmap)
library(sf)
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
library(igraph) # you'll need to install this
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(tidygraph) # you'll need to install this
##
## Attaching package: 'tidygraph'
## The following object is masked from 'package:igraph':
##
## groups
## The following object is masked from 'package:stats':
##
## filter
library(ggraph) # you'll need to install this
## Loading required package: ggplot2
Also, some data. You’ll find what follows in this zip file.
I’ve made both a ‘geospatial’ and a ‘network-oriented’ version of the data.
Here are the geospatial layers.
intersections <- read_sf('network/nodes/nodes.shp')
road_segments <- read_sf('network/edges/edges.shp')
It is important to look at the nodes and edges in their geographical context.
tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(road_segments) +
tm_lines(col='orange') +
tm_shape(intersections) +
tm_dots(col='red', size=0.005)
Here you see how the network representation treats is concerned both with nodes (or vertices) and the connections (or edges) between them.
The graph version of these data is in a single graphml file, which we load using the igraph function read_graph.
G <- read_graph('network/network.graphml', format='graphml')
This file (which you can examine in a text editor) is in the graphML format, and includes information about both nodes and edges in a single combined file. This is a relatively commonly used format for exchanging network data among programs that perform graph analysis, such as Gephi.
I built all three of these datasets using the excellent osmnx package developed by Geoff Boeing.
Now… graphs are rather complicated things. They effectively consist of two sets of entities, the nodes and the edges. Further, each edge consists of two nodes that it connects. To see this do
G
## IGRAPH 367edf3 D--- 1114 2471 -- unnamed_UTM
## + attr: streets_per_node (g/c), name (g/c), crs (g/c), lat (v/n), lon
## | (v/n), y (v/n), x (v/n), osmid (v/c), highway (v/c), id (v/c),
## | service (e/c), junction (e/c), tunnel (e/c), access (e/c), ref (e/c),
## | bridge (e/c), lanes (e/c), geometry (e/c), length (e/c), maxspeed
## | (e/c), highway (e/c), name (e/c), oneway (e/c), osmid (e/c), id (e/c)
## + edges from 367edf3:
## [1] 1-> 493 1-> 4 1-> 877 2-> 794 2-> 714 3-> 9 3->1114 3-> 966
## [9] 4-> 1 5-> 878 5-> 99 5-> 6 6-> 5 7-> 879 8-> 148 8-> 354
## [17] 9-> 3 9-> 11 9-> 727 10-> 953 11-> 9 12-> 953 12-> 415 12-> 10
## [25] 13-> 148 14-> 457 14-> 15 14-> 512 15-> 16 15-> 14 15-> 810 16-> 142
## + ... omitted several edges
What you are looking at is a rather concise summary of the graph object. It has 1114 nodes and 2471 edges. There are a number of vertex attributes (tagged v/c) such as lat, lon, y, x, and also edge attributes (tagged e/c) including a geometry. Unfortunately, the igraph package is not very intuitive to work with.
A recently released packages helps us out here by allowing us to see the igraph object in a more ‘tabular’ way.
G <- as_tbl_graph(G)
G
## # A tbl_graph: 1114 nodes and 2471 edges
## #
## # A directed multigraph with 1 component
## #
## # Node Data: 1,114 x 7 (active)
## lat lon y x osmid highway id
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 -41.3 175. 5424646. 1749430. 301920256 nan 301920256
## 2 -41.3 175. 5425957. 1746838. 290308097 nan 290308097
## 3 -41.3 175. 5426484. 1747580. 274522122 nan 274522122
## 4 -41.3 175. 5424663. 1749513. 301920267 nan 301920267
## 5 -41.3 175. 5424982. 1749447. 301920269 nan 301920269
## 6 -41.3 175. 5424773. 1749489. 301920270 nan 301920270
## # … with 1,108 more rows
## #
## # Edge Data: 2,471 x 17
## from to service junction tunnel access ref bridge lanes geometry length
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 493 "" "" "" "" "" "" "" "" 50.9
## 2 1 4 "" "" "" "" "" "" "" "" 84.978
## 3 1 877 "" "" "" "" "" "" "" "LINEST… 91.923
## # … with 2,468 more rows, and 6 more variables: maxspeed <chr>, highway <chr>,
## # name <chr>, oneway <chr>, osmid <chr>, id <chr>
This is a bit more readable, and helps us to see what we are dealing with.
Better yet would be a drawing! This turns out to be somewhat fiddly also. But here is an example.
ggraph(G, layout='igraph', algorithm='nicely') +
geom_node_point(colour='red', size=0.5) +
geom_edge_link(colour='grey', width=0.5) +
coord_equal() +
theme_graph()
## Warning: Existing variables `x`, `y` overwritten by layout variables
What the heck is that?! It’s actually the same network that we saw in the map view above, but with the road segments joining nodes represented now as straight lines. This is the essence of the street network connectivity, without the complication of all the twists and turns of the roads themselves. This may not impress you very much, but once we allow ourselves to ignore the geographical detail, we can start to see the structure of the network more clearly. One way to do this is with different graph drawing algorithms.
One example is the multidimensional scaling algorithm, which attempts to place nodes so that their positions relate to their distances from one another in network space.
ggraph(G, layout='igraph', algorithm='mds') +
geom_edge_link(colour='grey', width=0.5) +
geom_node_point(colour='red', size=0.5) +
coord_equal() +
theme_graph()
## Warning: Existing variables `x`, `y` overwritten by layout variables
Redrawing the graph is not very useful unless you really understand its structure. There are a number of categories of graph analysis method that can help us with this.
The centrality of a node or edge in a graph is a measure of its importance in the network sructure in some sense. This can be as simple as how many nodes a node is connected to (more is more central), although this tends not to be very interesting in a road network.
intersections$centrality <- degree(G, mode='in')
tm_shape(road_segments) +
tm_lines() +
tm_shape(intersections) +
tm_dots(col='centrality', style='cat')
A more interesting option is betweenness centrality. This determines the centrality of nodes based on how often they are found on the shortest paths in the network from every location to every other location. This approach often highlights choke points in a network.
intersections$centrality <- betweenness(G)
tm_shape(road_segments) +
tm_lines() +
tm_shape(intersections) +
tm_dots(col='centrality')
Additional methods are closeness which determines on average which nodes are closest to all others, and page_rank which uses a complex matrix analysis method, identical to Google’s pagerank algorithm, based on random walks on the network (for this one, you have to use centrality <- page_rank(G)$vector to extract the values.
Give those two a try (there are many more!) to get a feel for things.
The connectivity structure of a network may mean that there are distinct regions within it that are relatively cut off from one another while being well connected internally. In network science these regions are known as communities and many algorithms have been developed to perform community detection (this is analagous to cluster detection in multivariate data). Many of these only work on undirected graphs, so before trying them we will open an undirected version of the graph we have been looking at.
UG <- read_graph('network/network_.graphml', format='graphml')
intersections$community <- as.factor(membership(cluster_louvain(UG)))
tm_shape(road_segments) +
tm_lines() +
tm_shape(intersections) +
tm_dots(col='community', style='cat')
In a relatively well connected network, it is surprising that community detection can work at all (since everywhere is pretty much connected to everywhere else!) but even so the groupings identified by this method are interesting to explore and relate to our understanding of the geography of central Wellington. There are quite a few other methods available, but my experiments suggest that for these data only cluster_louvain, cluster_spinglass and cluster_edge_betweenness have much luck. Give each a try, and see what you think. Is there any way to determine which partition is the most ‘correct’?
Having determined some communities of possible interest, it may be instructive to visualize these in the network space, determined by network structure.
ggraph(G, layout='igraph', algorithm='mds') +
geom_edge_link(colour='grey', width=0.5) +
geom_node_point(aes(colour=intersections$community)) +
coord_equal() +
theme_graph()
## Warning: Existing variables `x`, `y` overwritten by layout variables
Hopefully this has given you a taste of how different the world can look when we start to consider not just simple Euclidean distances among things, but to consider their connectivity in other ways. It is worth emphasizing that a street network (even in Wellington!) is not very different to an open space in terms of the connection it affords between places. When we consider much less geographically coherent networks (such as airlines, Internet, etc.) then the ‘space’ opened up for exploration can be very different indeed.